DIADEM is R package for differential analysis of Hi-C data. It takes advantage of significant linear correlations of main diagonals between different Hi-C data sets (cell lines, experiments, etc.) - usually the first 100 diagonals from the main diagonal are considered. DIADEM uses GLM to model these dependancies and then quantifies deviations from the model in a probabilistic way.

This vignette contains the description of our method:

If you want to spare yourself extra reading and jump straight to analysis part, please see the quick start chapter. It contains a step-by-step introduction to Hi-C differential analysis with our package using an example Hi-C data set.

1 Quick Start

Load the package:

library("DIADEM")

To compare Hi-C experiments, you will need 2 files with Hi-C matrices. In this tutorial, we will use Hi-C data sets of human IMR90 and human MSC cells lines from GSE63525 and GSE52457 studies respectively available as sample data in DIADEM package. For simplicity we will only use chromosome 18. First, we read the data:

mtx.fname.imr90 <- system.file("extdata", "IMR90-MboI-1_40kb-raw.npz",
                               package = "DIADEM", mustWork = TRUE)
mtx.fname.msc <- system.file("extdata", "MSC-HindIII-1_40kb-raw.npz",
                             package = "DIADEM", mustWork = TRUE)

hic.comparator <- HiCcomparator(mtx.fname.imr90, mtx.fname.msc, mtx.names = c("18"), do.pca = TRUE)

1.1 Contact maps and A/B compartments

Illustrate Hi-C contact maps with A/B compartments (feel free to adjust margins as you want):

dense.imr90 <- sparse2dense(hic.comparator$maps1[["18"]][c("i","j","val")],
                            N = hic.comparator$maps.dims[["18"]][1,"n.rows"])
dense.msc <- sparse2dense(hic.comparator$maps2[["18"]][c("i","j","val")],
                          N = hic.comparator$maps.dims[["18"]][2,"n.rows"])

plot.margins <- 0.1
n.plots <- 4

par(mfrow = c(2,2), mai = rep(plot.margins, n.plots))

plot_contact_map(dense.imr90)
plot_contact_map(dense.msc)
plot_pc_vector(hic.comparator$pc1.maps1[["18"]])
plot_pc_vector(hic.comparator$pc1.maps2[["18"]])
IMR90 and MSC contact maps with A/B compartments of human chromosome 18.

Figure 1.1: IMR90 and MSC contact maps with A/B compartments of human chromosome 18.

1.2 Contact maps and TADs

Determine TADs for both Hi-C maps and illustrate them:

tads.imr90 <- map2tads(dense.imr90, resolution = 40*1000, window.bp = 400*1000, delta.bp = 80*1000)
tads.msc <- map2tads(dense.msc, resolution = 40*1000, window.bp = 400*1000, delta.bp = 80*1000)

plot_with_inset(list(dense.imr90), c(600,900), c(600,900), args.regions = list(tads.imr90))
plot_with_inset(list(dense.msc), c(600,900), c(600,900), args.regions = list(tads.msc))
IMR90 contact map of human chromosome 18 with TADs.

Figure 1.2: IMR90 contact map of human chromosome 18 with TADs.

MSC contact map of human chromosome 18 with TADs.

Figure 1.3: MSC contact map of human chromosome 18 with TADs.

1.3 Coverage and decays

Compare coverage and decays for IMR90 and MSC, for decays use log10 scales for both X and Y axis:

library("ggplot2")

coverage <- dominating_signal(hic.comparator, which.signal = "coverage")

ggplot(coverage, aes(x = i, y = sum.contacts, color = dataset)) +
  geom_point(size = 0.5) +
  geom_smooth(alpha = 0.5) +
  facet_wrap(~ name, ncol = 1, scales = "free") +
  theme(legend.position = "bottom")

decay <- dominating_signal(hic.comparator, which.signal = "decay")

ggplot(decay[decay$diagonal != 0,], aes(x = diagonal, y = mean.contacts, color = dataset)) +
  geom_point(size = 0.5) +
  scale_x_log10() +
  scale_y_log10() +
  facet_wrap(~ name, ncol = 1, scales = "free") +
  theme(legend.position = "bottom")
Coverage comparison between IMR90 and MSC Hi-C data of human chromosome 18.

Figure 1.4: Coverage comparison between IMR90 and MSC Hi-C data of human chromosome 18.

Decay comparison between IMR90 and MSC Hi-C data of human chromosome 18.

Figure 1.5: Decay comparison between IMR90 and MSC Hi-C data of human chromosome 18.

1.4 Fold Change and Differential maps

First, calculate the fold change and differential maps IMR90 / MSC and IMR90 - MSC respectively. Then, illustrate results with zoomin of 600 - 900 bins region. By default, when plotting fold change maps log10 scale is used. When plotting the differential map, one should indicate square root transformation of data for better visibility (see examples below):

merged <- merge(hic.comparator)
merged.18 <- merged[["18"]]

merged.18$fc <- merged.18$val.x / merged.18$val.y
merged.18$difference <- merged.18$val.x - merged.18$val.y

dense.fc <- sparse2dense(merged.18[c("i","j","fc")], N = hic.comparator$maps.dims[["18"]][1,"n.rows"])
dense.diff <- sparse2dense(merged.18[c("i","j","difference")],
                           N = hic.comparator$maps.dims[["18"]][1,"n.rows"])

plot_with_inset(list(dense.fc, fc = TRUE, colors.pal = c("blue","white","red")), c(600,900), c(600,900), which.map = "contact.map")
plot_with_inset(list(dense.diff, breaks = 100, sqrt.transform = TRUE), c(600,900), c(600,900), which.map = "diff.map")
Fold Change map of IMR90 / MSC of human chromosome 18.

Figure 1.6: Fold Change map of IMR90 / MSC of human chromosome 18.

Differential map of IMR90 - MSC of human chromosome 18.

Figure 1.7: Differential map of IMR90 - MSC of human chromosome 18.

1.5 Diagonal correlations

Check the correlations between corresponding diagonals of IMR90 and MSC:

library("reshape2")

decay.cors <- decay_correlation(hic.comparator)
# wide to long
decay.cors.long <- reshape2::melt(decay.cors[c("name","diagonal","pcc","rho","tau")],
                                  id.vars = c("name","diagonal"), variable.name = "correlation",
                                  value.name = "coefficient")
# remove 0 diagonal (as it is non informative anyways) and illustrate results
ggplot(decay.cors.long[decay.cors.long$diagonal != 0,],
       aes(x = diagonal, y = coefficient, color = correlation)) +
  geom_point(size = 0.7) +
  scale_x_continuous(breaks = c(1,seq(0, max(decay.cors.long$diagonal), by = 250)[-1])) +
  facet_wrap(~ name, ncol = 1, scales = "free") +
  theme(legend.position = "bottom")

dc <- decay.cors[c("name","diagonal","pearson.pval","spearman.pval","kendall.pval")]
colnames(dc) <- c("name","diagonal","pearson","spearman","kendall")
decay.sig.long <- reshape2::melt(dc, id.vars = c("name","diagonal"),
                                 variable.name = "correlation", value.name = "pval")
decay.sig.long$neg.log.pval <- -log10(decay.sig.long$pval)

ggplot(decay.sig.long[decay.sig.long$diagonal != 0,],
       aes(x = diagonal, y = neg.log.pval, color = correlation)) +
  geom_point(size = 0.7) +
  geom_hline(yintercept = -log10(0.01)) +
  annotate("text", 1800, -log10(0.01), vjust = -1, label = "pval = 0.01") +
  scale_x_continuous(breaks = c(1,seq(0, max(decay.sig.long$diagonal), by = 250)[-1])) +
  ylab("-log10(pval)") +
  facet_wrap(~ name, ncol = 1, scales = "free") +
  theme(legend.position = "bottom")
Correlations between diagonals of IMR90 and MSC Hi-C data of human chromosome 18.

Figure 1.8: Correlations between diagonals of IMR90 and MSC Hi-C data of human chromosome 18.

Significances of correlations between diagonals of IMR90 and MSC Hi-C data of human chromosome 18.

Figure 1.9: Significances of correlations between diagonals of IMR90 and MSC Hi-C data of human chromosome 18.

1.6 GLM modelling of chromatin interactions dependancy

Construct Hi-C glm models:

hic.glm <- HiCglm(hic.comparator, diagonals = 1:150)

1.6.1 Background model

Plot expected vs observed plot of E[Y = MSC | X = IMR90]:

plots.eo <- plotHiCglm(hic.glm, ncol = 10, diagonals = as.character(1:100))

plots.eo[["18"]][["Y | X"]]
Expected vs observed interaction dependency (Y = MSC | X = IMR90). Each facet illustrates separate diagonal.

Figure 1.10: Expected vs observed interaction dependency (Y = MSC | X = IMR90). Each facet illustrates separate diagonal.

1.6.2 Differential (significance) matrix

When the background model is constructed differential map can be computed (and visualised):

diffmap <- hicdiff(hic.glm)
diffmap.dense <- hicdiff2mtx(diffmap, hic.glm$maps.dims)

# plot diffmap
diffmap.18.dense <- diffmap.dense[["18"]]
plot_with_inset(list(diffmap.18.dense, breaks = 100), c(1300,1700), c(1300,1700))
Differential/Significance map of human chromosome 18 IMR90 vs MSC cell lines.

Figure 1.11: Differential/Significance map of human chromosome 18 IMR90 vs MSC cell lines.

1.6.3 Detection of Long Range differential interactions

Finally groups of cells with significant depletion/enrichment can be identified (with some precision) as rectangle-like regions containing connected components:

lrdi <- differential_interactions(hic.glm)

# get rectangle like regions
regions <- lrdi[["18"]]$interacting.regions
# plot them --> only those which contain at least 5 cells inside connected component
plot_with_inset(list(diffmap.18.dense, breaks = 100), c(1300,1700), c(1300,1700),
                args.regions = list(regions[regions$n.cells >= 5, 2:5], lwd = 2))
plot_with_inset(list(diffmap.18.dense, breaks = 100), c(1700,1900), c(1700,1900),
                args.regions = list(regions[regions$n.cells >= 5, 2:5], lwd = 2))
Detection of differential long range interactions in IMR90 vs MSC map. Significances threshold in E[Y = MSC | X = IMR90] is determined using bilinear model.

Figure 1.12: Detection of differential long range interactions in IMR90 vs MSC map. Significances threshold in E[Y = MSC | X = IMR90] is determined using bilinear model.

Detection of differential long range interactions in IMR90 vs MSC map. Significances threshold in E[X = IMR90 | Y = MSC] is determined using bilinear model.

Figure 1.13: Detection of differential long range interactions in IMR90 vs MSC map. Significances threshold in E[X = IMR90 | Y = MSC] is determined using bilinear model.

Differential/Significance map of human chromosome 18, IMR90 vs MSC cell lines with long range differential interactions (zoomin 1).

Figure 1.14: Differential/Significance map of human chromosome 18, IMR90 vs MSC cell lines with long range differential interactions (zoomin 1).

Differential/Significance map of human chromosome 18, IMR90 vs MSC cell lines with long range differential interactions (zoomin 2).

Figure 1.15: Differential/Significance map of human chromosome 18, IMR90 vs MSC cell lines with long range differential interactions (zoomin 2).

2 About the Hi-C protocol

Hi-C is the high-resolution variant of the 3C (Chromosome Conformation Capture) technique, which samples chromatin interactions genome-wide (Lieberman-Aiden et al. 2009). The result of the Hi-C experiment is a library of paired-end reads indicating pairs of chromatin regions that were in close 3D proximity in initial sample (i.e. interacting regions). This library is then mapped to reference genome to locate interacting regions and interactions are counted and summarized in the form of non negative matrices called contact maps. To construct the contact map, one must first specify the resolution parameter a number of basepairs per unit (typically between mega- and kilo-basepairs). Correct selection of the resolution parameter was studied in the literature and there are some guidelines as to how to approach this problem (Lajoie, Dekker, and Kaplan 2015; Yang et al. 2017). A contact map cell with coordinates i,j contains a number of interactions between genomic regions i and j inside the sample captured in the experiment. An important note here is that in most cases the Hi-C protocol is performed on a sample containing billions of cells (rather than a single, isolated cell) and therefore the resulting contact maps contain sums of interactions over possibly distinct cell populations. There are 2 types of contact maps: inter- and intrachromosome. Usually due to the effect called chromosome territories leading to very low contact counts between chromosomes of interest are only intrachromsome matrices.
Contact map of human chromosome 18 IMR90 cell line in 40kb resolution.

Figure 2.1: Contact map of human chromosome 18 IMR90 cell line in 40kb resolution.

Hi-C is a complex protocol with many sources of bias (Yaffe and Tanay 2011). This makes the analysis of this type of data very challenging. Therefore, any biological study utilizing Hi-C data performed without taking these biases into account will lead to severely biased results.

Here we consider a problem of comparing 2 Hi-C datasets. Depending on experimental design it may be, for example, different cell lines or different experimental conditions. This sort of studies apper frequently in literature making the problem of countering biases important (Le Dily et al. 2014; Dixon et al. 2015; Lun and Smyth 2015). We present key issues arising in this type of assays and suggest a model, which have potential to improve the reliability of Hi-C differential analysis.

3 Hi-C differential analysis

The Hi-C differential analysis aims to discover and quantify regions that are significantly depleted or enriched in interactions between 2 contact maps of interest. First thing we should note is that when we compare two raw Hi-C datasets, both the read coverage and distance dependant contact decay will be significantly more diverse between samples then the actual cell-specific differences. This makes the direct comparisons of contact maps only suitable on raw data if one is interested in these global trends.
Coverages (sum of contacts per region) of human chromosome 19 from 2 Hi-C experiments: IMR90 and MSC cell lines.

Figure 3.1: Coverages (sum of contacts per region) of human chromosome 19 from 2 Hi-C experiments: IMR90 and MSC cell lines.

Decays (mean of contacts per diagonal) of human chromosome 19 from 2 Hi-C experiments: IMR90 and MSC cell lines.

Figure 3.2: Decays (mean of contacts per diagonal) of human chromosome 19 from 2 Hi-C experiments: IMR90 and MSC cell lines.

As an example, we take the scenario where in the first Hi-C experiment sequencing depth is 3 times that of the second experiment. Obviously, the naive comparison will lead to the conclusion that almost any pair of regions is enriched in interactions in experiment 1 with respect to 2.
Fold Change map of IMR90 / MSC of human chromosome 19.

Figure 3.3: Fold Change map of IMR90 / MSC of human chromosome 19.

This leads to, now standard techniques that focus on normalizing the coverage across chromosomal length. One prime example of such normalization is the iterative correction method (Imakaev et al. 2012) that assumes that the read coverage across the genome should be uniform. As we can see in fig. 3.4 it is quite effective in reducing the difference in total coverage.
Coverage (sum of contacts per region) of human chromosome 19 from 2 Hi-C experiments: IMR90 and MSC cell lines after Iterative Correction.

Figure 3.4: Coverage (sum of contacts per region) of human chromosome 19 from 2 Hi-C experiments: IMR90 and MSC cell lines after Iterative Correction.

However it does not seem to reduce the differences in distance dependant contact decay. As we can see in fig. 3.5 this bias is still strikingly different in the normalized data.
Decays (mean of contacts per diagonal) of human chromosome 19 from 2 Hi-C experiments: IMR90 and MSC cell lines after Iterative Correction.

Figure 3.5: Decays (mean of contacts per diagonal) of human chromosome 19 from 2 Hi-C experiments: IMR90 and MSC cell lines after Iterative Correction.

Fold Change map of IMR90 / MSC of human chromosome 19 with both maps iteratively corrected.

Figure 3.6: Fold Change map of IMR90 / MSC of human chromosome 19 with both maps iteratively corrected.

What is worse, different normalization methods can lead to different biases in the decay rate. As we can see in fig. 3.7 two well-known and widely used normalization methods HiCNorm (Hu et al. 2012) and ICE yield opposite relationship between contact decays of IMR90 and MSC datasets.
Comparison of contact decays (mean of contacts per diagonal) of human chromosomes 18 and 19 from 2 Hi-C experiments: IMR90 and MSC cell lines between HiCNorm and Iterative Correction normalization methods.

Figure 3.7: Comparison of contact decays (mean of contacts per diagonal) of human chromosomes 18 and 19 from 2 Hi-C experiments: IMR90 and MSC cell lines between HiCNorm and Iterative Correction normalization methods.

A possible workaround could be the normalization avoiding method, which tries to model similarities in coverage and contact decay between both Hi-C experiments simultaneously and then seeks for deviations from the model.

4 Modelling dependency between Hi-C datasets

Our method is based on an observation that the pattern of interactions between corresponding diagonals of Hi-C maps is similar and statistically significant for pairs of regions located nearby as measured by linear chromosomal distance. As an example consider below images illustrating Pearson, Spearman and Kendall correlation between pairs of consecutive diagonals from pair of contact maps. To calculate the correlations, only the cells containing interactions in both experiments (contact maps) are used.
Correlations between corresponding diagonals (i.e. 1 vs 1, 2 vs 2, etc) of IMR90 replicate 1 vs replicate 2 Hi-C contact map of human chromosome 19.

Figure 4.1: Correlations between corresponding diagonals (i.e. 1 vs 1, 2 vs 2, etc) of IMR90 replicate 1 vs replicate 2 Hi-C contact map of human chromosome 19.

Significances of the above correlations.

Figure 4.2: Significances of the above correlations.

Although correlations between experiments originating from different cell lines are much lower they are still highly significant indicating of similarities of the interactions pattern.
Correlations between corresponding diagonals (i.e. 1 vs 1, 2 vs 2, etc) of IMR90 vs MSC Hi-C contact map of human chromosome 19.

Figure 4.3: Correlations between corresponding diagonals (i.e. 1 vs 1, 2 vs 2, etc) of IMR90 vs MSC Hi-C contact map of human chromosome 19.

Significances of diagonals correlations.

Figure 4.4: Significances of diagonals correlations.

The trends shown above are representative of many other cell types and chromosomes analyzed by us.

After more thorough inspection of interaction patterns we hypothesized that the strength of interactions in contact map B dependes linearly on the strength of coresponding interaction (i.e. with the same coordinates) in contact map A for most of Hi-C contacts up to some distance between interacting regions. There is however a little fraction of interacting regions, which exhibit large deviation from linearity and are potential differential contacts. Based on this observation we assume that most of chromatin interactions is preserved even between different Hi-C datasets and their similarity can be modelled using linear model thus comprising null model. Due to apparent heteroscedasticity in observed dependency we choose to model similarity pattern using Negative Binomial regression. Our algorithm can be summarized in following steps: 1. first we fit IRLS regression to select potential differential interactions - points with weights equal to 0 are marked as outliers and are removed from fitting NB background model, 2. then we fit NB regression using data left after outlier removal, 3. afterwords we calculate significance of each point (including outliers) given the model, 4. finally we adjust pvalues using Benjamini-Hochberg method. Our package is based on robustreg and MASS packages, which handles fitting robust regression (IRLS) and GLM respectively. The pipeline can be summarized with below panels.
Input, a pair of Hi-C contact maps to be compared.

Figure 4.5: Input, a pair of Hi-C contact maps to be compared.

Model selection and p-value calculation.

Figure 4.6: Model selection and p-value calculation.

Resulting significance map.

Figure 4.7: Resulting significance map.

As can be seen on the last panel our model detects interactions that form large clusters of significant neighboring cells, which seems to indicate nonrandom effect.

5 Detection of differential significantly interacting regions

Given differential (significance) map often of interest are groups of enriched cells, which are close neighbors (in terms of euclidean distance in i, j, p-value space) for some p-value cutoff. To find such aggregates of cells we designed a simple algorithm based on connected components search. Below panel with pseudocode explains main steps of this procedure.
Selection of significantly enriched regions.

Figure 5.1: Selection of significantly enriched regions.

Briefly, the algorithm first selects non zero cells from significance matrix, sorts them and fits a bilinear model. An intersection point between linear functions derived from the model indicate threshold (red vertical line), which divides cells on significant and non significant differentially interacting regions. Additionally p-value threshold (grey vertical line) is fixed (by default to 0.01) to prohibit the selection of non significant cells in cases when the bilinear model assumption is violated.
Selection of significantly enriched regions, bilinear model fitting.

Figure 5.2: Selection of significantly enriched regions, bilinear model fitting.

Selection of significantly enriched regions, bilinear model fitting.

Figure 5.3: Selection of significantly enriched regions, bilinear model fitting.

In the last step a zero matrix with the same shape as initial significance matrix is constructed. Cell coordinates corresponding to that of significant ones are marked with 1. Resulting binary matrix is scanned for connected components. To perform a connected components search, the clump function from the raster package is used (Hijmans 2018). The algorithm outputs 2 elements:

By providing such an output, we give the users flexibility to individually filter resulting regions by the size of connected components (i.e. number of significant cells within component). An example output illustrated on significance map is presented below for 2 thresholds: components with at least 10 significant cells and components containing at least 20 significant cells.
Long range differential interactions with connected components containing at least 5 significant cells.

Figure 5.4: Long range differential interactions with connected components containing at least 5 significant cells.

Long range differential interactions with connected components containing at least 10 significant cells.

Figure 5.5: Long range differential interactions with connected components containing at least 10 significant cells.

References

Dixon, Jesse R, Inkyung Jung, Siddarth Selvaraj, Yin Shen, Jessica E Antosiewicz-Bourget, Ah Young Lee, Zhen Ye, et al. 2015. “Chromatin Architecture Reorganization During Stem Cell Differentiation.” Nature 518 (7539). Nature Publishing Group: 331.

Hijmans, Robert J. 2018. Raster: Geographic Data Analysis and Modeling. https://CRAN.R-project.org/package=raster.

Hu, Ming, Ke Deng, Siddarth Selvaraj, Zhaohui Qin, Bing Ren, and Jun S Liu. 2012. “HiCNorm: Removing Biases in Hi-c Data via Poisson Regression.” Bioinformatics 28 (23). Oxford University Press: 3131–3.

Imakaev, Maxim, Geoffrey Fudenberg, Rachel Patton McCord, Natalia Naumova, Anton Goloborodko, Bryan R Lajoie, Job Dekker, and Leonid A Mirny. 2012. “Iterative Correction of Hi-c Data Reveals Hallmarks of Chromosome Organization.” Nature Methods 9 (10). Nature Publishing Group: 999.

Lajoie, Bryan R, Job Dekker, and Noam Kaplan. 2015. “The Hitchhiker’s Guide to Hi-c Analysis: Practical Guidelines.” Methods 72. Elsevier: 65–75.

Le Dily, François, Davide Baù, Andy Pohl, Guillermo P Vicent, François Serra, Daniel Soronellas, Giancarlo Castellano, et al. 2014. “Distinct Structural Transitions of Chromatin Topological Domains Correlate with Coordinated Hormone-Induced Gene Regulation.” Genes & Development 28 (19). Cold Spring Harbor Lab: 2151–62.

Lieberman-Aiden, Erez, Nynke L Van Berkum, Louise Williams, Maxim Imakaev, Tobias Ragoczy, Agnes Telling, Ido Amit, et al. 2009. “Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome.” Science 326 (5950). American Association for the Advancement of Science: 289–93.

Lun, Aaron TL, and Gordon K Smyth. 2015. “DiffHic: A Bioconductor Package to Detect Differential Genomic Interactions in Hi-c Data.” BMC Bioinformatics 16 (1). BioMed Central: 258.

Yaffe, Eitan, and Amos Tanay. 2011. “Probabilistic Modeling of Hi-c Contact Maps Eliminates Systematic Biases to Characterize Global Chromosomal Architecture.” Nature Genetics 43 (11). Nature Publishing Group: 1059.

Yang, Tao, Feipeng Zhang, Galip Gurkan Yardimci, Fan Song, Ross C Hardison, William Stafford Noble, Feng Yue, and Qunhua Li. 2017. “HiCRep: Assessing the Reproducibility of Hi-c Data Using a Stratum-Adjusted Correlation Coefficient.” Genome Research. Cold Spring Harbor Lab, gr–220640.